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CN Abstract 

A simple way to model phenotypic evolution is to assume that after splitting, 
the trait values of the sister species diverge as independent Brownian motions. 
Relying only on a prior distribution for the underlying species tree (conditioned 

C" on the number, n, of extant species) we study the random vector (X\, . . . , X n ) 

of the observed trait values. In this paper we derive compact formulae for the 

j'T'j variance of the sample mean and the mean of the sample variance for the vector 

0_l (Xi,...,X n ). 

The key ingredient of these formulae is the correlation coefficient between two 

• — trait values randomly chosen from (X%, . . . ,X n ). This interspecies correlation 

coefficient takes into account not only variation due to the random sampling of 
two species out of n and the stochastic nature of Brownian motion but also the 

l— - 1 uncertainty in the phylogcnctic tree. The latter is modeled by a (supercritical 

or critical) conditioned branching process. In the critical case we modify the 
Aldous-Popovic model by assuming a proper prior for the time of origin. 

Keywords: Phylogenetic comparative methods, Birth and death process, 
Conditioned branching process, Branching Brownian motion, Uncertainty in 
phylogeny 
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1. Introduction 

A simple way to model phenotypic evolution for n related species is to as- 
sume that after splitting, the trait values (e.g. the logarithms of body sizes) 



X 

of the sister species diverge as independent Brownian motions (see Felsenstein 



19851. The resulting collection (Xl, . . . , X n ) of the tip species' trait values has 
a dependence structure caused by shared phylogeny. In this paper we derive 
compact formulae for the variance of the sample mean X — n (X\ + . . . +X n ) 
and the mean of the sample variance 5 2 = (n — 1) _1 Y17=i(Xi — X) 2 . These 
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formulae take into account not only the stochastic nature of Brownian motion 
but also uncertainty in the phylogenetic tree. 

Based on observed tip species data one would like to make statements about 
the stochastic process of evolution like the ancestral state X at the time of 
origin T and infinitesimal variance a 2 of the Brownian motion. These are im- 
portant questions addressed by phylogenetic comparative methods. Usually 
this sort of inference attempts to incorporate the knowledge of the phylogenetic 



tree estimated from independent data (Butler and King 2004 Hansen et al 



2008 Bartoszek et al. in review I . There is however uncertainty attached to the 



estimated tree which should be somehow reflected in any subsequent analysis. 

All currently available methods addressing such statistical issues rely on 
simulations. Pagel and Lutzoni (20011 and Huelsenbeck and Rannala (20031 



propose to use an MCMC approach to generate a sample of plausible phyloge- 
netic trees each one with its posterior probability attached as a weight. |Butler 
and King (2004) do not include phylogeny uncertainty in their OUCH R (|R 



Development Core Team 20101 package but say that in can be incorporated 



in their framework, if one can compute likelihood values (e.g. posterior prob- 
abilities from a Bayesian estimation procedure) for candidate trees. Then the 
complete likelihood function is a product of the tree's likelihood and the like- 
lihood conditional on the tree and comparative data. A Bayesian estimation 
procedure implemented by Lemey et al. (2010) uses a tree rescaling step, with 
each branch of the phylogeny being independently rescaled by an appropriately 
(e.g. gamma or log-normal) distributed random variable. 

These methods face a number of common challenges. The first one is com- 
putational, as estimating a phylogeny can be computationally extremely de- 
manding. The second is interpretational, whilst the weighing of results is fully 
justified statistically one could raise biological objections whether the result is 
actually biologically meaningful for all parameters of the assumed model of trait 
evolution. An extreme hypothetical example is if we would have two competing 
phylogenies each with equal likelihood. The first results in a regression slope 
of 1, the second — 1. The average of them is 0. A regression slope of means 
that there is no relationship between the two variables while both phylogenies 
indicate that there is a relationship except that we don't have strong enough 
evolutionary data to decide about the direction of this relationship. The third 
problem is that since we are merely "trying out" different possible phylogenies 
we always run the risk of not considering the ones close to the true one. 

Here we propose a different approach making use of explicit analytical cal- 
culations. We model the unknown phylogenetic tree for n extant species using 
a conditioned birth-death process with speciation rate A and extinction rate \i 
as described by Gernhard (2008). The corresponding distribution of random 
trees with n tips is a posterior distribution resulting from the improper uniform 
prior on the time of origin T. The appropriate range of the rates < fi < A has 



an important region /i = A representing the critical case (Aldous and Popovic 
|2005[ ) with the speciation and extinction events being equally likely. In the su- 
percritical case /i < A the height of the tree is expected to be lower due to the 
expansive speciation regime. A key test example of the supercritical birth-death 
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Figure 1: A branching Brownian motion simulated on a random tree with n = 5 tips using the 
TreeSim and mvSLOUCH software. Panel A: the trait evolution for five species is modeled by 
a Brownian motion with a = 1. Panel B: the species tree disregarding the trait values. Panel 
C: a convenient presentation of speciation times. 



model is the classical Yule model ( Yule 1924 ) of pure birth process when p = 



2. Summary of main results 

In our setting both the variance of the sample mean 

Var [X n ] = ^n-^l + (n - l)p n ) E [T] (1) 
and the mean of the sample variance 

E[S 2 n ] = a\l-p n )E[T], (2) 
are compactly expressed (see |Appendix A| in terms of the correlation coefficient 

1 



J)Var[X] 1 



J2 Cov[Xi,Xj] (3) 



^27 L J l<i<j<n 

and the mean time to the origin E [T] . 

Sections |3j [4j [5] present analytical formulae for p n and E [T] in the Yule, 
supercritical and critical cases. These formulae are summarized in Tab. [T] in 
terms of three principal cases for the species tree model. Observe that we incur 
no loss of generality by specifying one of the two parameters (p, A). For example, 



in a seemingly more general case with < p < A the same formula, Eq. (12 1 
holds with A replaced by the ratio X/ p. 

What we call the interspecies correlation coefficient p n is the correlation be- 
tween two trait values randomly chosen among n observed. Next, to clarify the 
exact meaning of p n we describe an algorithm producing a pair of random vari- 
ables having p n as the correlation coefficient for a given set of parameters (n, A). 
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Table 1: Summary of formulae for p n and E [T]. 



Algorithm 1 Generate two random variables with a correlation of p n 



generate a species tree with n tips using TreeSim (Stadler 2009 2011) 



generate n trait values by running a branching Brownian motion over the 
species tree simulated in step 1 using mvSLOUCH (Bartoszek et al. [m] 



review 



3: choose at random two out of n trait values generated in step 2. 



The steps 1-3 of Algorithm [T] (implemented by us in R) were repeated many 
times to collect enough data for estimating the correlation coefficient between 
the underlying pair of random variables, see Fig. [2j The simulation results 
presented in Tab. [2] compare the correlation coefficient estimated from the sim- 
ulated trees p n to the true value of p n and the value given by an appropriate 
approximate formula. Notice that we did not simulate the critical case with a 
proper prior as suitable software is currently lacking. Simulations of the criti- 
cal case with improper prior are time consuming. Therefore the critical case is 
represented with a smaller number of dots on the graph. 

In the critical case the correlation coefficient p n is undefined as both the 
covariance between two sampled species and the species' variances take infinite 
values. We overcome this difficulty by modifying the Aldous-Popovic approach, 
we replace the improper prior distribution for T by the uniform prior on a 
finite interval (0,7V). We believe that considering a proper prior in the critical 
case makes the model biologically more relevant. A realistic value of N gives an 
upper bound on the number of speciation events for a group of related species as 
traced back to their common ancestor. This number depends on the particular 
kind of organisms in consideration and in many cases cannot be larger than 
several thousands. 



Model 


n Trees p n p n Approximation 


/i = 0, A = 1 
/i = 1, A = 2 
p = l, \= 1.01 
p. = 1, A = 1 


30 1000 0.430 0.449 0.503 using ( 
30 1000 0.506 0.502 0.609 using ( 
30 1000 0.784 0.794 0.689 using ( 
10 100 0.870 NA NA 
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Table 2: Summary of simulations. 
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Figure 2: Regression line fitted to the simulated data (thick line) compared to the line y = p n x 
(dotted line) with p n given by the exact formula. Upper-left panel: the Yule case. Upper- 
right: the supercritical case with A = 2. Lower-left: the near-critical case with A = 1.01. 
Lower-right: the critical case with improper prior (here the dotted line is y = x). 
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In Section [6] the obtained formulae for p n and E [T] are combined with Eqs. 
([I]) and ([2]) to produce compact expressions for Var [X n ] an d E [S 2 ] in the three 
main cases. These analytic expressions can be used, for example, to construct 
phylogenetic confidence intervals for the ancestral trait value X , which would 
take into account tree uncertainty. This issue is one of the subjects of our 
forthcoming paper where among other things some of the results of this paper 
for the Brownian motion model are extended to the Ornstein-Uhlenbeck model. 

Section [7] presents a connection to a new measure of how balanced are phy- 
logenetic trees recently introduced by Mir et aT] ( 2012). Appendix A and 
|Appcndix B| contain intermediate results. Appendix B| is mainly dealing with 
the properties of an important for this paper expression, 



V m — 1 f— J im 1 J (n + i)m % 



(4) 



which satisfies < e n m < 



n(m— 1) 



for 1 < 771 < OO. 



3. Correlation coefficient for the Yule model 

Assume that the trait values evolve according to Brownian motions with 
variance a 2 . At the time of origin the ancestral trait is believed to have a fixed 
value Xq. Due to the formula for the total variance, the variance of a sampled 
trait value equals, 

Var[V 2 ] = E{Vax [Xi\T}]+ Var [E[Xi\T]} 
= E [a 2 T] + Var [0] = a 2 E [T] . 

If Xij is the ancestral trait value at the time of the most recent common 
ancestor for two sampled species, then 

Cov [X t , Xj] = E [Cov [Xi,Xj\T, n^X^}] + Cov [E [X t \T, Tij , X v ] , E [X,\T, t zj ,X z 
= E [0] + Var [X^] = a 2 E [T - ry] . 

Putting this into ([3| we get 

_ E [T - t] 
_ E [T] ' 

where r is the time to the most recent common ancestor for a pair of randomly 
chosen extant species. 

The denominator in Eq. ^ is computed as 



CO 

E [T] — J tq n (t)dt, 



where q n (t) is the distribution density for the time to origin T. Assuming the 
Yule model with rate A = 1 for the unknown species tree it is easy to see that 

E [T] = a n . (6) 
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Indeed, using the formula (see Gernhard 2008 1 

q n (t)=n(l-e- t ) n - 1 e-\ 
and applying a change of variables 



(7) 



v = 1 — e 



t = - ln(l - v), dv = (1 - v)dt, 



we conclude 



l l 
E [T] = -nj\n(l-v)v n ~ 1 dv = - J\n(l-v)d(v n -1) 
o o 

= I^v = a n . 
o 



In the framework of the conditioned reconstructed process model (see [Gem- 



hard 20081 the random species tree (extinct species removed) is conveniently 



described in terms of speciation times s%, . . . , s n _i, see panel C in Fig. [T] Con- 
ditioned on the time of origin T = t the random variables Si,...,s n _i are 
independent and identically distributed according to a cumulative distribution 
function to be denoted by F t (s). Due to this observation the numerator Eq. ([5| 
can be found from the formula 



E\T-t] 



2(n - k) 
n ( n - !) 



E 




F t k (s)ds q n (t)dt, 



(8) 



derived in Appendix A 



In the Yule case we have 



F t (s) 



1 - e 



^l{0<s<t} + l{s>t} 



(9) 



which together with Eq. ([8| after applying a change of variables it = 1- 



gives 



71-1 



1 V 



E[T-t] = ^Ein-Vff^-dudv 

k=l 

n-1 l l , 
= AE ff f^di;"- fc d U . 

n— 1 J J X—u 

k=l u 

Switching the integration order we find, 

ECT-r] = ^El { -^G?^du 

k=l 
n— 1 

2 / \ 2(n— a n ) 

= ^=1 E(On-°fc)= n-1 • 
fe=l 

Combining this with Eqs. ([5| and ([6]) we arrive at 

2 



n — 1 V a. 



(10) 
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Figure 3: Exact (black line) and approximate (gray line) formulae for p n in the Yule case 
(left), supercritical case A = 2 (centre) and critical case with proper prior N = 10000 (right). 



where a n = \ i s the n-th harmonic number. Notice that Eq. (10) implies 



oo, 



(11) 



Inn + 7 + o(l) ' 

where 7 = 0.577 ... is the Euler constant, in other words, — — Inn — 7—^0 as 
n — > 00. The exact formula Eq. (10) and the approximate formula Eq. (11), 



with the term o(l) being disregarded, are illustrated in Fig. |3j left panel 

4. Supercritical case 

In the supercritical case the correlation coefficient has a more complicated 
but still surprisingly compact form in terms of the function from Eq. Q, 



2 / n(l + e», A ) 
n - 1 \ a„ + e n \ 



In — 
111 A-l 



A - 1 



(12) 



Observe that Eq. (j 1 0|) can be recovered from Eq. (|12|) by letting A — ¥ 00. 

(13) 



Furthermore, Eq. (12) implies a close counterpart of Eq. (11), 

2 



Pn = 



lnn + 7-ln^j- + o(l) ' 



uniformly in A > Ao for any Ao > 1. 

Specializing on the nearly critical case, when A = 1 + 1 /N for some large N, 
we derive the following asymptotic result, 

Pn = ^-W7mTr 1 - ,x , I77V ' N^oo. (14) 



2(]nN-a n + T) + o(l) 

The fact that p n — ¥ 1 as N — > oo is a consequence of the improper prior 
distribution assumption for the time of origin T. Besides this approximation, it 
can be shown that for any fixed positive a, 



l+Ig 

In a + 7 + I a 



1 



a 



N -¥ oo, n/N -¥ a, 



(15) 
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Figure 4: Approximated correlation coefficient as a function of model parameters. Left, the 
supercritical case, Eq. ( |13[> : black A = 1.5, gray A = 2, light gray A = 5. Center, the critical 
case with a proper prior, Eq. ( |19| : black n = 50, gray n = 100, light gray n = 500, lightest 
gray n = 1000. Right: the critical case with a proper prior, Eq. pOp. 



where I a = L e , , so that e a I a = f 

" JO a+x ' " Ja 



is the exponential integral. 



The exact formula Eq. (12) and the approximate formula Eq. (13 1 are illus- 
trated in Fig. |3j central panel. Another illustration of Eq. ( 13 ) is given on Fig. 
|4j left panel. 



We derive Eq. ( 12 ) using 



Qn(t) 
Ft(s) 



n\ n (\ 1^2 (l-st) 

x s A— 1+xt 
A— l+x s xt ' 



where Xt = 1 — e ^ These expressions are obtained from more general 



relations due to Gernhard ( 2008 ) after specifying the parameter values as \i — 1 
and A > 1. Denoting S = A -1 we can write, 



E [T] = nS(X - l) s 



(l-*t) 



(1 - 5(1 - xt))^ 1 



tdt. 



A change of variables, 



1 — Xt 



u ~ l-<5(l-x t )' x a ' t ~ 1-<5d' 

(A-l)t = lni^, dv = X(l-5v)(l-v)dt, 



results in, 



E[T] 



(A-l)-V(ln^)^- 1 
o v y 



dr 



(A - I)" 1 / (ln(l - 5v) - ln(l - v)) d(v n - 1) 
o 

1 n-l 1 

fc=0 
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Applying Eqs. (B.l) and (B.2) from Appendix B leads to, 
E [T] = — !- - (a n + e„, A - In - - ^ 



(16) 



Furthermore, with u — 1 _ s ^ 1 '_ x s and du = A(l — 5u)(l — u)ds Eq. (|Sj) 
entails, 

n—l 1 v 

EpT-r] = ^iE(n-fc)JJ ( r.xi:.) dt*dt; 



Due to. 



n-l 



E 

fe=i 



u 



(l-Su)(l-u) 



dv 



k=l 

n-l 1 1 

i-l) f-< J J (l-<5«)(l-«) 
k=l u 



n—ln—k—1 n j. I ,- , n — l „ . 4 , 

m T du v— * / ju J du 



k=l 1=0 



3=1 



5u 



we arrive at, 



E[T-r] ^ ^ 

fe=i 

|R3l 



- (n-lj(A-l) ( n + ne «,A - («« + e »,A - hi j^j 

which together with Eqs. (16) and ([5| gives Eq. (12). 



5. Critical case with a proper prior 



Under the improper uniform prior on (0, oo) for T one has (see Aldous and 



Popovic 2005) 



nt n ~ 

= ( 1+f )n+l > te (°'°°) 

implying the infinite mean E [T] . To remedy this inconvenience we use a proper 
uniform prior on (0, N) and put m = N £ 1 . The corresponding posterior distri- 
bution of T has density, 



with finite mean, 
obtained as 

E[T] = 



g»(*)= (1 + f)n+1 , te(o,i\0 

E [T] = rae„, m , 



(17) 



JV 



t n dt 



(! + <)«+! 



1-a; 
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For the critical case with a proper prior we establish 

Pn = 2 



2.V ( I 



n — 1 



2JV(JV+1) ^ | «»-hi(jV+l) ^ ( . |s) 



n(n — 1) 



where m = 1 + 1/N. Interestingly, the following approximate version of Eq. 

p« = i - on AT 1 rr nv w -> ( 19 ) 

2(hJV — a n ) + o(l) 



is almost the same as Eq. ([14]). The counterpart of Eq. (15) is given by, 

In a + 7 x 



p»->2--(l + -^]+-^|J 
a \ l a J a z 



/a 



, TV — > oo, ra/iV -> a. (20) 



Eq. ( 18 ) is illustrated on the right panel of Fig. |3j while Eqs. ( 19 ) and ( 20 ) are 
illustrated on the central and right panels of Fig. |4j 

Next we derive Eq. ( 18 ) using the formula F t (s) = (j 1 ^*) obtained by 



and Popovic (2005). Entering this into Eq. (JSJ) gives, 

„ i N t 



Aldous 



E[T-r] = ^-5>-fe) 
n— 1 ^-^ 

k=l 







S \ I 1 + 1 \ 1 



1 + s 



t 



rdsdt. 



Replacing the variables s and t with y = -^- s and x = we get, 

n-l 1/™ a: 



E[T-r] 



2m™ 



=i E 1 fy k (l-y)- 2 dydx n - k 

k=l 
n-l 1/™ 

zj E / (™y) fe (l - y)" 2 (l - (my) n - k )dy 

k=l 



and then, 
E[T-r] 



^iE 



fc=l 



2ne r , 



ne n7n +n 
n— 1 V m— 1 



(m- 



a 1) 2(a n + eri,m m m m i)) 



which combined with Eq. (17) gives Eq. (18) 



6. Variance of sample mean and expectation of sample variance 

Our formulae for p n and E [T] obtained in the previous sections imply the 
following compact expressions for Var and E [S%\ thanks to Eqs. ([I]) and 

In the Yule case (/x = and A = 1) Eqs. ^ and ( |To| give 

Var[X„] =a 2 (2-^). (21) 
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Figure 5: Variance of sample mean in the Yule case given by Eq. \2l\ with points indicating 
simulated values. Each point is the estimate of the variance based on 10000 simulations for 
each value of n. In simulations Xq = and a' 2 = 1. 



In Fig. [5] we can see that the above formula and its consequence Var [X n ] — > 
2a 2 as n — > 00 agree well with simulations. Notice that this immediately implies 
that the unbiased point estimate X of the ancestral state Xq is not consistent 
as the variance of the estimator tends to a constant 2a 2 . We can compare this 
with the result of Ane ( |2008 ) who deals with another estimator of the ancestral 
state. The estimator of Ane (20081 is unbiased and converges (in L 2 and almost 
surely) to a random variable with a non-zero variance, bounded from below 
by a 2 t/k, where t is the maximum length of a branch stemming from the root 
and k is the number of branches stemming from the root (fc = 2 in our model). 
However, |Ane| ( |2008[ ) considers a different model of tree growth as n — > 00 and 
the tree is assumed to start at the root. 

Using Eqs. ^ and (21) we obtain for the Yule case 



E[S 2 



n — 1 



n — 1 



(22) 



so that E [S 2 ] ~ <j 2 Inn as n — > 00. This suggest an unbiased estimate for the 

J2 



variance a 



In the supercritical case (fi = 1 and A > 1) Eqs. (16) and (12) entail 

9 ( l+e„,> 



Var \X n 



2a 2 



A-l 



n (A-l) 2 ( fl ™ + e,l - x 1° A J 



_ 2 (^i+")K,-ln x i T )+(j A r -"K.i+l-2n 
(A-l)(n-l) 



E[S 2 ] = a 2 

In the critical case (fi = A = 1) with a proper prior imposed on the time of 
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origin we use Eqs. (17) and (18 1 to get 



Var[X„] =(T 2 ( e „, m (2n-2iV+^±il-l 



n _ 2N(N+1) _ \ , A„ lN 

J 71—1 

where m = 1 + 1/N and 

A n , N = 2N(n -(N+ l)(a n - ln(N + 1))). 

To compare different cases we put together asymptotic formulae as n — > oo 
(and additionally n/N — > a in the critical case): 

{2 in the Yule case, 
in the supercritical case, (23) 
c a n in the critical case, 

where c a — 2a~ 2 ((a 2 — a + l)I a — a + lna), and 

!lnn in the Yule case, 
^ in the supercritical case, (24) 
d a n in the critical case, 

where d a = a~ 2 ((2a — a 2 — 2)I a + a — lna). 



7. Connection with total cophenetic index 



A recent work due to Mir et al. (2012) considers a new balance index for 



phylogcnctic trees termed the total cophenetic index. The total cophenetic index 
for a given tree with n tips is defined as, 

l<i<j'<n 

the sum of the number of branches ifiij from the root to the most recent common 
ancestor of tips i and j. Their model of the phylogenetic tree is different from 
the one we discuss in that there is no branch prior to the root, i.e. the tree 
"begins" at the first branching point. Under the Yule model they show that the 
expectation of the total cophenetic index for a tree with n tips is 

E[$ n ] =n(n + l-2a„). (25) 

We next demonstrate a short proof of the latter formula based on the ap- 
proach developed in this paper. Denote by T n the time of the tree root so that 
T — T n is the length of the initial branch until the first splitting (for illustration 
see Fig. [l] panel A). For the conditional Yule tree with n tips the random 
variable 

l<i<j<n 
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is the sum of branch lengths connecting the root with the most recent common 
ancestor of tips i and j. Since the mean branch length of this random tree is 
0.5 (see Mooers et al.| in press |Stadler and Steel 2012, for results on branch 
length expectations) we have E [$„] = 2 E [$* ] , and Eq. ( 25 ) follows from 



e = e 



( T » _ Ti j) 

l<i<j<n 



E [T n 



where as before r is the time to the most recent common ancestor for a randomly 
chosen pair of tips. Indeed, using the simple fact proved in | Appendix A| 



E [T - T„] = 1, 



(26) 



we get the required equality 



(2)(E[T-r]-l) 

(S) ( 2 ^-l)=f(n + l-2a„). 
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Appendix A. 



This section contains derivation of formulae ([lj, ph, Q, and (261. 
Relations ([l} and ([2| come straightforwardly from 



n 2 Var [X n ] = Var 



and 



E[S*] 



= nVar[V] + 2 £ Cov [X f , X,-] 

l<i<j <n 

= i(l + (n-lK)Var[X] 



= ^E^£X?-(X n ) 2 

= ^ (E [X 2 ] - E [X n ] 2 ) = ^ (Var [X] - Var [X n ] ) 
= (l-p n )Var[X] 



due to the variance formula Var [X] = a 2 E [T] characterizing the Brownian 
motion model of evolution considered here. 
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Equation ([8]) is obtained as follows. If k is the corresponding distance be- 
tween the sampled tips, then 



(n = k) = ^—^, k = l,...,n-l, 

\2J 



(A.l) 



because in a row of n positions there are n — k pairs on a distance k. Since r 
is the maximum of k independent and identically distributed speciation times, 
we get, 

P(t > s\T,k) = l-F£(s), 

and Eq. Q follows from, 

T T 

E [T - t\T, k] = T - J(l - F£{s))ds = J Ff(s)ds. 



Finally, ([26) follows from and ^ 



E[T-T n ] = E 





t 

= J ne-*/(l - e- s ) n - x dsdf 

o o 
i i 

= n J J(l — v) n ~ 1 v~ 1 dvdu 



u 
1 

n f(l-v) n - 1 dv = 1. 
o 



Appendix B. 

In the main text we use the following relations for the function in Eq. Q: 



ejfe. 



f x dx 
J m—x 5 



n-1 

fc=0 
n-1 

fe=l 



m— 1 



1/m 

TO fc J yfe^ _ y)- 2 dy 



i(a n -\-e n m — In 

(m-1) 2 ! 



Equation (B.ll follows from, 



f^^rdv 

J l — vm L 



1/m 

m k+i r xdx _ 

J 1 — x 


fc+1 j m _ y« 
m— 1 ^ 
i=l 



1/m 

fe+1 f ( 1 l-x l 

J \ 1-x l-x 



da; 



meu 



(B.l) 
(B.2) 
(B.3) 
(B.4) 
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To prove Eq. (B.2| put x = Efc=o e k,m and using, 

men, m = fc _1 + e k , m , 

set up a linear equation, 



(B.5) 



mx = a n + x + e Tl 



In 



m — 1 



whose solution is Eq. (B.2|. Similarly, to obtain Eq. (B.3) put x — E&=i ^ e *v 
and use Eq. (B.5) to get a linear equation, 



^ mkek-x,m = mx + 171*^2 e k-i,m = n + x + ne niTn , 



fc=i 



fe=i 



(B.l) and (B.5) as, 



which in view of Eq. (B.2) gives Eq. (B.3). Equation (B.4) follows from Eqs 



1 /m 1/m 

m k J y k (l-y)~My = m k J - dy k 

o o v ' 

m— 1 J 1—2/ 

PROOF of Eqs. ([14]) and (Jl9). Let us write o m instead of O f(m - l) 2 In 
asm jl. Observe that, 



a n + e 



In 



m — 1 



Thus as m 4- 1, 



m — 1 



since, 



* — ' V i % I m — 1 

t=i v 7 



n-1 \ 71 E > 

V] m l In - V 

^ / m— 1 * 

i=0 / i=l 



/n-1 

/ 



E 



(m-l) E 3 
n—i i j =1 



0„ 



+ (2) + ("*-l) + o„ 



E 



E j 

3=1 



ra-2 
= 1 



E(n— i— l)(n— i) 
2i 
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It follows, 



1 -j- G n ^ m 

n(l + e n , m ) 



1 - a n + In ^) (1 + n(m - 1)) + o„ 



rn — 1 



1 

rn — 1 



(2) (ta^-on + f) (m-i) + 0m , 



and 



2(l+e„, m ) _ 2m(g„+e„ |m -ln ^rr) 
(n— l)(m— 1) n(n— l)(m— l) 2 

2 (l - Or, 



In - In 

m — 1 / m— 1 



3 , ^ 

2 T m-l 



Combining these results we find that Eq. (18 1 indeed implies Eq. (19): 



2(l+e n , m ) 2m(n„+e nm -ln^ I ) 
(n-l)(m-l) n(n-l)(m-l) 2 

1/2 

l n m "l 1 — Qti+o(1) ■ 



Equation (14 1 is derived from Eq. (12 1 in a similar way. 



PROOF of Eqs. (151 and (20 1. Equations (15) and (20) are easily ob 



tained from Eqs. (12) and (18) using the following integral approximation for 
the function in Eq. (4]), 



E 



•f (n + ijrri 1 J a + x 

1 n 



c da; 



, n — > 00, n(m — 1) — > a 



for a given positive a. The last convergence follows from a Riemann sum rep- 
resentation, 



y - — — = y sf us) . 

4^ (n + i)ra % ^ J y ' 



where S 



1 and 



/(*) 



-a:/ (m— 1) 



n(m — 1) + .t a + a; 



We can recognize that e n ^ m converges to a transformation of the exponential 
integral namely, 



-dx, n — > 00, n(m — 1) — > a. 
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The previously presented formulae for e„, m are not suitable for numerically 
calculating its value but Graham Jones pointed out in personal correspondence 
that by a change of variables 



which is well suited for computation. Alternatively, as again pointed out by 
Graham Jones, in Eq. Q one can directly bound the tail (sum of terms from 
some K ) of the infinite series by m 1 ~ K " /((K + n){m — 1)). 
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